clean factorsSPGMI and stocksCRSP data sets #83

spinnj · 2022-03-25T23:24:00Z

factorsSPGMI and stocksCRSP had the following issues:

certain sectors incorrectly named (misspelled),
7 securities in the sample of 300 were in sectors that were to be removed (as per Doug Martin, requested no securities in Financials, Utilities, or Real Estate sectors),
per Doug Martin, the CapGroup assignments were incorrectly mapped (our comparison suggested about 75% accuracy) - we remapped CapGroup assignments on a point in time basis using an outside data set and the CRSP percentile breakpoints of 0.7, 0.85, 0.98 to divide Large, Mid, Small, and Micro cap stocks and also created a new variable "CapGroupL" which is the ending group assignment on 2015-12-31 for each stock,

Additionally, certain other changes were made at the request of Doug Martin:

The sector names were modified slightly for brevity,
"TickerLast" was renamed to "TickerL"

Incorporating this pull request would:

Modify factorsSPGMI and stocksCRSP as described (80868 rows vs. previously 82800) with new CapGroupL variable (1 extra column).
Add a folder within the Sandbox directory titled "Vestcor Data Cleaning" that has an R file with some rough code used to clean the data, and the MarketCapPercentiles.csv file historical data used to reclassify stocks CapGroup assignments.

We can be reached at [email protected] and [email protected] for comments/questions/concerns.

JustinMShea · 2022-03-26T17:32:36Z

@spinnj Thanks for your PR. It is nice to see folks contributing to the software who appear to be using both for work and teaching.

Concerning your comments above.

Additionally, certain other changes were made at the request of Doug Martin:
The sector names were modified slightly for brevity,
"TickerLast" was renamed to "TickerL"

I noticed that you changed some of the original source data names, which is highly inadvisable. This is something that Doug Martin insists on doing, and has ignored both Brian and I's experience and advice on the matter. Therefore, please revert the sector names to their original format asap.

Thank you,
Justin

spinnj · 2022-03-26T17:36:59Z

@spinnj Thanks for your PR. It is nice to see folks contributing to the software who appear to be using both for work and teaching.

Concerning your comments above.
Additionally, certain other changes were made at the request of Doug Martin:
The sector names were modified slightly for brevity,
"TickerLast" was renamed to "TickerL"
I noticed that you changed some of the original source data names, which is highly inadvisable. This is something that Doug Martin insists on doing, and has ignored both Brian and I's experience and advice on the matter. Therefore, please revert the sector names to their original format asap.

Thank you, Justin

OK. Checking to make sure I understand exactly what you want me to revert - is the concern the renaming of TickerLast column to TickerL or is it the modified Sectors (changing row contents for each security)?

braverock · 2022-03-26T17:51:32Z

I'm tired of trying to impose best practices, but varying from vendor names is almost never a good idea, because then the analyst (or student) can't look up what things mean online, and they can't use other people's code that assumes the vendor data will be unmolested.

If we want the parsed R objects to have different names, I guess that is arguably OK fropm a pedantic point of view, but even then, use factor labels, and provide a function to change the change the labels for the factor levels, don't change the source data.

Changing the names of the input data sources is frankly dumb, because then you can never go back to the original data source and reconcile. You can also never merge new data from the input data source. We've already seen merge problems when Doug wanted to get updated data from S&P or further adjusted things. If people who are supposedly highly familiar with the data set can't merge it cleanly, then how are other people going to feel if they download data from S&P?

In a production data pipeline, as I'm sure you and your team are aware, you usually ingest the source data as is, and then create processes for cleaning or labeling or fixing bad data points, outliers, etc. These things are supposed to be documented and repeatable. Your sandbox code is a start, but it isn't documented, and the documentation for the data sources in the package doesn't explain that the data has been altered, how it was altered, and where the code to apply or reverse the alterations may be found.

Both Justin and I have voiced concerns several times about renaming things. They won't match S&P's naming conventions or documentation, you can't merge new data, the list is too long to mention. Not to mentio, we are far beyond the days of fixed-width terminals, how does shortening a name make it more readable or usable? I would suggest it is far more common in practice to clean things like badly placed spaces or puntuation in names, or to provide printed names that are "pretty" and usually longer (like full company names instead of data identifiers).

Best practice is to create code to make changes, and apply that code in a systematic way. Even with students or a textbook, we should be teaching people how real data sources work, warts and all, not cleaning it all up, giving them artificially perfect toy examples, and then wondering why they fail when they get to industry and 80% of the work is understanding, parsing, and manipulating input data.

Like I said, I'm sick of arguing about it where it is clear that there is no intention to adhere to industry best practices. Our time is probably better spent fixing bugs and making improvements in the functions themselves (there are many, see the issues list).

spinnj · 2022-03-26T17:56:03Z

I haven't been part of any arguments and so I think I've come in to the middle of a conversation here.

Overall, understood, can we simply revert everything from my pull request and move forward in your preferred way - I'm not opposed to that in any way.

JustinMShea · 2022-03-26T18:16:48Z

OK. Checking to make sure I understand exactly what you want me to revert - is the concern the renaming of TickerLast column to TickerL or is it the modified Sectors (changing row contents for each security)?

Yes, @spinnj it is both the TickerLast to TickerL and modified Sectors labels. As Brian helpfully provided a longer explanation, original names are best for a variety of reasons.

spinnj · 2022-03-26T18:18:15Z

OK, understood. Best course of action may be to revert the whole thing, let me go back and keep things as they were and match the sector names to the GICS documentation, and I can submit another PR later?

JustinMShea · 2022-03-26T18:43:40Z

Done. Just to illustrate a related point to this conversations, I ran the R CMD check after the PR was merged and got a new error in the Examples files(See #43 on this issue).

Running examples in ‘FactorAnalytics-Ex.R’ failed
The error most likely occurred in:

> ### Name: fitFfm
> ### Title: Fit a fundamental factor model using cross-sectional regression
> ### Aliases: fitFfm coef.ffm fitted.ffm residuals.ffm
> 
> ### ** Examples
> 
> 
> # load data 
> data(stocksCRSP)
> data(factorsSPGMI)
> 
> stocks_factors <- selectCRSPandSPGMI(stocks = stocksCRSP, factors = factorsSPGMI,
+                                     dateSet = c("2006-01-31", "2010-12-31"), 
+                                     stockItems = c("Date", "TickerLast", 
+                                                    "CapGroup", "Sector", 
+                                                    "Return", "Ret13WkBill",
+                                                    "mktIndexCRSP"),
+                                     factorItems = c("BP", "LogMktCap", "SEV"), 
+                                     capChoice = "SmallCap",
+                                     Nstocks = 20)
Error in `[.data.table`(facModDat, , .SD, .SDcols = colNames) : 
  Some items of .SDcols are not column names: [TickerLast]
Calls: selectCRSPandSPGMI -> [ -> [.data.table
Execution halted

This is due to "TickerLast" being changed to "TickerL". It will impact a bunch of other files too, since we recycle many of the examples, but the Examples error only prints the first one.

The main goal is to get this on CRAN and then have it stay there for many years with minor maintenance. Thus we need to stamp out bugs and eliminate errors, warnings, and notes on build in the existing code base. We don't want to add others in the process if we can help it because we have a lot of other issues to stamp out (documented under the issues tab).

In addition, please see the project board here for the strategy I outlined to tackle existing issues and get this on CRAN. I believe your PR will also solve issue #73, which is most welcome, so thanks again.

braverock · 2022-03-27T12:52:31Z

This commit broke things, so reverting in #84 was the correct response for now.

It is actually very close to mergable.

I commented on the cleaning script file directly, but I will summarize my comments here to keep the discussion in the thread.

lines 7/8 of the cleaning script use absolute paths. this will break for anyone who doesn't have the code at C:/FA?FactorAnalytics. please change to relative paths from project root.

lines 102/103 rename column TickerLast to TickerL. This is a breaking change, as Justin noted above. My personal opinion is that making a breaking change to save three characters isn't necessary. Anyone familiar with the CRSP data set is used to the name TickerLast.

If we want to make breaking changes anyway, then the requirement for mergability is to refactor to make it non-breaking. There are ~18 places in the R/ directory that reference TickerLast, and about 14 places in the man/ directory. Refactoring would change the .R files first, and regenerate the relevant roxygen files in man/. If this changed all references to TickerLast and R CMD check runs without breaking on TickerLast, then I think this would be mergable.

lines 89 and 96 (reordering columns) will make future merges of vendor data break. there are already other breaking changes in the data (like changing the text names of Industry Sectors rather than changing labels for factor levels), so I don't consider this reordering to be a chnage that breaks future merges any more than other previous changes. it doesn't seem necessary, but there's no prevailing reason not to include it in the commit.

Sorry for the noise and the (minor) additional work, and thanks for contributing.

spinnj · 2022-03-27T13:31:47Z

Much appreciated. I will go through it and clean things up as per your recommendations and additionally deal with issue #85. Jon

…

________________________________ From: Brian G. Peterson ***@***.***> Sent: Sunday, March 27, 2022 9:52 AM To: braverock/FactorAnalytics ***@***.***> Cc: Spinney, Jon (Vestcor) ***@***.***>; Mention ***@***.***> Subject: Re: [braverock/FactorAnalytics] clean factorsSPGMI and stocksCRSP data sets (PR #83) WARNING: External Email! This commit broke things, so reverting in #84<#84> was the correct response for now. It is actually very close to mergable. I commented on the cleaning script file directly, but I will summarize my comments here to keep the discussion in the thread. lines 7/8 of the cleaning script use absolute paths. this will break for anyone who doesn't have the code at C:/FA?FactorAnalytics. please change to relative paths from project root. lines 102/103 rename column TickerLast to TickerL. This is a breaking change, as Justin noted above. My personal opinion is that making a breaking change to save three characters isn't necessary. Anyone familiar with the CRSP data set is used to the name TickerLast. If we want to make breaking changes anyway, then the requirement for mergability is to refactor to make it non-breaking. There are ~18 places in the R/ directory that reference TickerLast, and about 14 places in the man/ directory. Refactoring would change the .R files first, and regenerate the relevant roxygen files in man/. If this changed all references to TickerLast and R CMD check runs without breaking on TickerLast, then I think this would be mergable. lines 89 and 96 (reordering columns) will make future merges of vendor data break. there are already other breaking changes in the data (like changing the text names of Industry Sectors rather than changing labels for factor levels), so I don't consider this reordering to be a chnage that breaks future merges any more than other previous changes. it doesn't seem necessary, but there's no prevailing reason not to include it in the commit. Sorry for the noise and the (minor) additional work, and thanks for contributing. — Reply to this email directly, view it on GitHub<#83 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEVGTQG2RVLEHKSVL3QNT7LVCBKZTANCNFSM5RVRHOUA>. You are receiving this because you were mentioned.Message ID: ***@***.***> Please note our name change to Vestcor Inc. and our new email domain is @Vestcor.org This e-mail message (including attachments) is confidential and may be privileged. Any unauthorized distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege. If you have received this e-mail in error, please notify us and delete it, including any attachments, from your computer system and records. --------------------------------------------------------------------------------------------------------- S'il-vous-plaît noter que nous sommes maintenant Vestcor Inc. et le nouveau domaine de courriel électronique est @Vestcor.org Ce courriel (y compris les pièces jointes) est confidentiel et peut être privilégié. La distribution ou la divulgation non autorisée de ce courriel est interdite. Sa divulgation à toute personne autre que son destinataire ne constitue pas une renonciation de privilège. Si vous avez reçu ce courriel par erreur, veuillez nous aviser et éliminer ce courriel, ainsi que les pièces jointes, de votre système informatique et de vos dossiers.

clean factorsSPGMI and stocksCRSP data sets

73054d3

braverock marked this pull request as ready for review March 26, 2022 14:54

braverock merged commit 2550984 into braverock:master Mar 26, 2022

spinnj mentioned this pull request Mar 26, 2022

Revert "clean factorsSPGMI and stocksCRSP data sets" #84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean factorsSPGMI and stocksCRSP data sets #83

clean factorsSPGMI and stocksCRSP data sets #83

spinnj commented Mar 25, 2022

JustinMShea commented Mar 26, 2022

spinnj commented Mar 26, 2022

braverock commented Mar 26, 2022

spinnj commented Mar 26, 2022

JustinMShea commented Mar 26, 2022

spinnj commented Mar 26, 2022

JustinMShea commented Mar 26, 2022

braverock commented Mar 27, 2022

spinnj commented Mar 27, 2022 via email

clean factorsSPGMI and stocksCRSP data sets #83

clean factorsSPGMI and stocksCRSP data sets #83

Conversation

spinnj commented Mar 25, 2022

JustinMShea commented Mar 26, 2022

spinnj commented Mar 26, 2022

braverock commented Mar 26, 2022

spinnj commented Mar 26, 2022

JustinMShea commented Mar 26, 2022

spinnj commented Mar 26, 2022

JustinMShea commented Mar 26, 2022

braverock commented Mar 27, 2022

spinnj commented Mar 27, 2022 via email